1 Background

1.1 Aim

From cumulative to incident case counts in the PAHO PLISA Health Information Platform for the Americas dengue database

1.2 Objectives

  1. Clean, process and transform 53 downloaded PAHO weekly cumulative datasets for all countries into one dataframe.
  2. Summarise the data coverage for each country (admin 0 level) in the cumulative “total of dengue cases” table.
  3. Summarise the data coverage for each country (admin 0 level) in the incident “suspected dengue cases” (Non-severe cases by epidemiological week) table.
  4. Investigate how the data coverage varies according to burden of dengue disease.
  5. Develop methods to convert cumulative to incident case counts, accounting for gaps in data coverage.

1.3 Rationale

It is important to note that the cumulative datasets are for reports of “all dengue cases : suspected, probable, confirmed, non-severe and severe cases, and deaths. probable dengue cases : person who has a fever or history of fever for 2-7 days duration, two or more symptoms of dengue and one serological test positive or epidemiological nexus with confirmed dengue case 14 days before onset of symtpoms”.

This dataset, while it has “select epidemiological week” slider options for all available epidemiological weeks of every yer, the epidemiological week for which information is actually available/reported can differ from that on the slider. Some countries have missing weeks of data.The the extent of missing data varies greatly between countries, and over time. While moving from cumulative to incident weekly counts for countries with 100% reporting of sequential cumulative counts may be straightforward, by subtracting the previous weeks count from the following weeks count, this is not so for countries with missing data. There is already incident data available through the “Chart” and “Epi Curve” options, however this is for non-severe cases of dengue only. This is not the same as “total of dengue cases” as measured in the cumulative table, which is our priority for this dataset.

We want to investigate if this incident dataset can be leveraged to gap-fill the cumulative dataset when moving to incident counts, and what the difference between the two is.

2 Data preprocessing

PLISA PAHO portal only permits download of all countries cumulative data sets in a week by week fashion, by moving the epidemiological week slider while selecting all countries. These require the file format to be encoded differently, and re-saved in this format for further processing.

The files are loaded and a new directory created for processed data.

file_list<-list.files("raw_data/source_data"); file_list

old_dir<-paste0("raw_data/source_data")
new_dir<-paste0("raw_data/formatted_data")

#Encoding & save again

for(j in 1:length(file_list)){ 
  
  oldfiledir <- paste0(old_dir, "/", file_list[j]) 
  newfiledir <- paste0(new_dir, "/", file_list[j])
  fileConn <- file(newfiledir)
  
  writeLines(readLines(con <- file(oldfiledir, encoding = "UCS-2LE")), fileConn)
  close(fileConn)
  
}

csv_files <-list.files(new_dir, full.names=T, recursive=F, pattern="\\.csv$") #directory names for loading files

csv_files <- csv_files[-c(1)];csv_files
temp <- lapply(csv_files, read.table, sep="\t", colClasses = rep("character",16), header=TRUE)

merge <- rbindlist(temp)
#Column rename and select
all_dt <- merge %>% 
  select(Country.or.Subregion, Serotype, Year, Confirmed, Deaths, 
         Epidemiological.Week..a., Severe.Dengue..d., Total.of.Dengue.Cases..b.) %>% 
  rename(adm_0_name = Country.or.Subregion,
         year = Year, 
         epidemiological_week = Epidemiological.Week..a.,
         severe_dengue_cumulative = Severe.Dengue..d.,
         dengue_total_cumulative = Total.of.Dengue.Cases..b.)  %>%
  filter(!epidemiological_week %in% c("-")) # * AL: why does it have some data with epidemiological week being "-" here? 

#convert epidemiological week and year to numeric
all_dt$epidemiological_week <- as.numeric(all_dt$epidemiological_week)
all_dt$year <- as.numeric(all_dt$year)

#remove duplicates
all_dt <- all_dt %>% 
  distinct()%>%
  arrange(adm_0_name, year, epidemiological_week)

#add the end date of the cumulative count for each row ("cum_end_date"). 
for(i in 1:nrow(all_dt)){
 all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)

} #*AL: maybe try "lubridate::epiweek()" here? it's faster. 

#add cum_start_date based on the cum_end_date of the previous row. This is done by lagging cum_end_week because there may be cases where the cumulative count is not reported for consecutive weeks.
all_dt <- all_dt %>%
  arrange(adm_0_name, year, epidemiological_week)%>%
  group_by(adm_0_name, year)%>%
  mutate(cum_start_date = ymd(lag(cum_end_date))+1)

#create "complete" dataset which has a row for every epidemiological week for every year, with NA for the dengue_total count if no data reported for this week. Make sure number of epidemiological weeks in each year is correct according to epidemiological calendars for each year (some have 52 weeks, others have 53).
all_dt <- all_dt %>% ungroup()%>%
  tidyr::complete(adm_0_name, year, epidemiological_week)%>%
  filter(!(year %in% c(2015:2019, 2021:2022) & epidemiological_week == 53))  

all_dt %>%
  group_by(adm_0_name) %>% tally()

#add cum_end_date again for rows that were missed from the original data but were filled by 'complete' function above. Add the start/end date of epidemiological week for all rows("calendar_start_date" and "calendar_end_date"). 

for(i in 1:nrow(all_dt)){
  all_dt$calendar_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)

  if (is.na(all_dt$cum_end_date[i])==FALSE) next
 all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)

}

# add cum_start_date, calendar_start_date and source_id
all_dt <- all_dt %>%
  mutate(cum_end_date = ymd(cum_end_date))%>%
  mutate(cum_start_date = ifelse(is.na(cum_start_date), as.character(ymd(cum_end_date)-6), paste0(cum_start_date)))%>%
  mutate(cum_interval = interval(cum_start_date, cum_end_date) %/% weeks(1)+1)%>%
  mutate(calendar_end_date = ymd(calendar_end_date))%>%
  mutate(calendar_start_date = ymd(calendar_end_date)-6)%>%
  mutate(source_id= ifelse(is.na(dengue_total_cumulative)==FALSE, paste0("all_PAHO_countries_PAHO_2014_to_2022_EW",epidemiological_week), NA))

#summary(all_dt$cum_interval)
#administrative unit codes
all_dt$adm_0_name <- stri_trans_general(all_dt$adm_0_name,"Latin-ASCII")

gaul_codes_world <- read.csv(file="ref_data/gaul_codes_world.csv")

gaul_codes_world <- gaul_codes_world %>% 
  select(ADM0_NAME, ADM0_CODE) %>% 
  rename(adm_0_name = ADM0_NAME)%>%
  rename(adm_0_code = ADM0_CODE)%>%
  distinct()

#plyr::count(all_dt$adm_0_name) #52 countries
#plyr::count(gaul_codes_world$adm_0_name) #273 countries

ctry_check <- merge(all_dt, gaul_codes_world, by="adm_0_name", all.x=T)

#check the country names that did not match with gaul names
ctry_check %>% group_by(adm_0_name, adm_0_code)%>% tally() %>% filter(is.na(adm_0_code))

ctry_lookup <- c(
  # gaul name to paho name
"Turks and Caicos islands" = "Turks and Caicos Islands",
"British Virgin Islands" = "Virgin Islands (UK)",
"United States Virgin Islands" = "Virgin Islands (US)"
)
# PAHO countries without gaul codes 
#"Bonaire, Saint Eustatius and Saba" 
#"Curacao"
#"Saint Barthelemy", 
#"Saint Martin", * is this the same country as Sint Maarten?
#"Sint Maarten", 

gaul_codes_world <- gaul_codes_world %>%
  mutate(adm_0_name = recode(adm_0_name, !!!ctry_lookup))

all_dt <- merge(all_dt, gaul_codes_world, by="adm_0_name", all.x=T)

Each data set is then processed into two data hierarchies, primary (A) and secondary (B), with calendar dates added. Epidemiological week and year remain, as these are useful for processing later on.

The case counts are left in cumulative counts at this stage. This may be useful for users who prefer to deal in cumulative counts. here, the calendar start date is fixed at the beginning of epidemiological week one of each year, and the calendar end date is moved forward to account for the cumulative count.

#select columns for master table_A and change to date format
all_dt_A <- all_dt %>% 
  select(adm_0_name, adm_0_code, year, epidemiological_week,
         calendar_start_date, calendar_end_date, 
         cum_start_date, cum_end_date, cum_interval,
         dengue_total_cumulative, source_id 
         )%>% 
  mutate(calendar_start_date = ymd(calendar_start_date),
         calendar_end_date = ymd(calendar_end_date), 
         cum_start_date = ymd(cum_start_date), 
         cum_end_date = ymd(cum_end_date))

all_dt_A_tidy <- all_dt_A %>% 
  distinct() %>% 
  mutate_at(c(2), ~replace(., is.na(.), 0)) %>% 
  group_by(adm_0_name, year) %>% 
  arrange(adm_0_name, year, epidemiological_week)

#Check admin codes and fill gaps
admin_NA <- all_dt_A_tidy %>% 
  dplyr::filter(adm_0_code==0)

# *AL: please explain this part - why adm_0_code assigned for Virgin islands only? 
#all_dt_A_tidy$adm_0_code[all_dt_A_tidy$adm_0_name=="Virgin Islands (UK)"]<-"39"
#all_dt_A_tidy$adm_0_code[all_dt_A_tidy$adm_0_name=="Virgin Islands (US)"]<-"258"


write.csv(all_dt_A_tidy,"processed_data/all_dt_A.csv", row.names = FALSE)
#Dataset B/Secondary dataset, which contains information on : disease severity, serotype, mortality.

all_dt_B <- all_dt %>% 
  select(adm_0_name, adm_0_code, 
         calendar_start_date, calendar_end_date, 
         dengue_total_cumulative, Deaths, Serotype, severe_dengue_cumulative, source_id)%>% 
  rename(dengue_total=dengue_total_cumulative,
         severe_dengue=severe_dengue_cumulative)%>%
  mutate(calendar_start_date = ymd(calendar_start_date),
         calendar_end_date = ymd(calendar_end_date))%>% 
  filter(dengue_total!= "NA")

all_dt_B_tidy <- all_dt_B %>% 
  distinct() %>% 
  mutate_at(c(2), ~replace(., is.na(.), 0))%>% 
  group_by(adm_0_name) %>% 
  arrange(desc(calendar_start_date))

admin_NA <- all_dt_B_tidy %>% 
  dplyr::filter(adm_0_code==0)

#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (UK)"]<-"39"
#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (US)"]<-"258"

all_dt_B_tidy<-all_dt_B_tidy %>% 
  pivot_longer(c(severe_dengue, dengue_total, Deaths),
               names_to="dengue_classification", values_to="total")

all_dt_B_tidy <- all_dt_B_tidy[!(all_dt_B_tidy$dengue_classification == ""), ]
all_dt_B_tidy <- all_dt_B_tidy %>%  
  select(adm_0_name, adm_0_code, calendar_start_date, calendar_end_date, 
         dengue_classification, total, Serotype, source_id)

write.csv(all_dt_B_tidy,"processed_data/all_dt_B.csv", row.names = FALSE)

3 Data revision and imputation

3.1 Revisions to cumulative count values

Since some cumulative counts may be revised down over time, we need to identify the rows in our dataset that contain such values. To achieve this, we will use a for loop. For instance, if the cumulative count for week 5 was initially 100 but revised down to 90 in week 6, we will tag week 5 as having revised down values. Cumulative count for week 5 was compared with all subsequent rows (e.g., week 6 to week 53).

data$result <- FALSE

#split the dataset by country and year (16 countries * 9 years = 144 datasets) and save them into a list
dt_list <- data %>%
  group_by(adm_0_name, year)%>%
  group_split() 


# iterate over each dataset
for (x in 1:length(dt_list)){
  df <- dt_list[[x]]
  if (nrow(df) < 2) next
  # iterate over each row
  for (i in 1:(nrow(df) - 1)) {
    # if ith row contains NA, skip this row and go to next row
    if (is.na(df$dengue_total_cumulative[i])) next
    
  # iterate over each subsequent row
    for (j in (i + 1):nrow(df)) {
    # skip the row if it's NA and compare with the next row instead
      if (is.na(df$dengue_total_cumulative[j])) next
    
    # compare ith and jth row and if ith row contains value that is larger than a value in jth value, return "TRUE" in "result" column. 
      if (df$dengue_total_cumulative[i] > df$dengue_total_cumulative[j]) {
        dt_list[[x]]$result[i] <- TRUE
        # testing for loop 
        # print(paste0("I compared ", i, "th row and ", j,  " row of dt_list ", x))
    } 
    
  }
}
  # the last row is not compared, so set its value to NA
  #dt_list[[x]]$result[nrow(df)] <-NA
}

view(dt_list[[1]])
# merging datasets back
data2 <- rbindlist(dt_list)

3.2 Zero filling

When the total sum of weekly = annual count (assuming annual count data is correct), NA values were identified and replaced with count values for the following three use cases.

3.2.1 Case 1. An entire year without a case

summary(cum_dt$dengue_total_cumulative2) # NA=8679 (decreased from 8978)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       4     440   35908    6259 2363490    8679

3.2.2 Case 2. Filling in values for the weeks after the latest cumulative count

summary(cum_dt2$dengue_total_cumulative2) # NA=7648 (decreased from 8679)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       6     375   33751    5501 2363490    7648

3.2.3 Case 3. Duplicated cumulative counts where NAs exist in between

summary(cum_dt3$dengue_total_cumulative3) #NA 7040 (decreased from 7648)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       6     314   32583    5086 2363490    7040

3.3 Imputation

We first attempted using ‘na.approx’ function of ‘zoo’ package. Different imputation approaches will be explored later on.

#create epidemiological week 0 row to permit imputation for weeks where epidemiological week has no reported cumulative count.
data3 <- cum_dt3 %>%
  as_tibble() %>%
  group_by(adm_0_name) %>% 
  group_modify(~ add_row(.x,year=2014:2022, 
                         epidemiological_week = 0, 
                         dengue_total_cumulative3 = 0 ,.before = 0))%>% 
  arrange(adm_0_name, year, epidemiological_week)

data3 <- data3 %>%
  group_by(adm_0_name, year)%>%
  mutate(dengue_na_approx = as.integer(na.approx(dengue_total_cumulative3, na.rm=FALSE, rule = 2, maxgap=2)))%>%
    mutate(dengue_na_spline = as.integer(na.spline(dengue_total_cumulative3, na.rm=FALSE, maxgap=2, method="hyman")))%>%
  ungroup()

# maxgap=2 argument is used to specify that the function can fill in missing values up to two consecutive positions in the vector. rule=2 argument was used to ensure that the interpolated values are always greater than or equal to the observed values in the vector. 

summary(data3$dengue_total_cumulative3) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       4     256   31730    4792 2363490    7040
summary(data3$dengue_na_approx)# NA 6823 (decreased from 7040)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       4     270   31436    4825 2363490    6823

3.4 Threshold setting for imputation

# Extract the number of consecutive imputed values for each year and each country
consec <- data.frame(unclass(rle(is.na(data3$dengue_total_cumulative3))))
#  The 'lengths' column indicates the number of consecutive rows and the 'values' column indicates whether cumulative counts are NAs (values==TRUE) or not.

# Highlight those year & epidemiological weeks where there are >= 6 imputed values consecutively, so that these can be discarded moving forward (as we have less confidence in imputed values over periods of longer than 6 weeks)
consec <- consec %>%
  mutate(values2 = ifelse(lengths < 6 & values == TRUE, FALSE, values))

consec <- as.data.frame(lapply(consec, rep, consec$lengths))

data3 <- cbind(data3, consec)

Compare the original raw data, revised data, and imputed data. The black dotted line indicates original raw values but were excluded because those were revised down later, the red line indicates the revised & gap filled data (but not yet imputed), and the blue line shows the imputed data.

3.5 Convert cumulative to incident counts

Prepare dataset for converting cumulative to incident counts. This is more straightforward for data where we have consecutive epidemiological weeks, as a lag column can be created and subtracted from the previous weeks count.

## # A tibble: 7 × 3
## # Groups:   source_cat [4]
##   source_cat source_cat2     n
##   <chr>      <chr>       <int>
## 1 Imputed    Imputed       229
## 2 NA         Revised-NA    151
## 3 NA         <NA>         6672
## 4 Raw        Raw         15462
## 5 Zero       Zero1         287
## 6 Zero       Zero2        1031
## 7 Zero       Zero3         608

Of 24,440 rows, 151 rows have been replaced with NAs following the ‘Revised down’ stage. In total 2,155 gaps have been filled following the ‘Zero filling’ and ‘Imputation’ stage.

Following is an example of comparing different imputation methods for Panama and Argentina. Na.spline() function gives better imputation results than na.approx() when converted to incidence count. Na.spline() will only be included in our final dataset.

##  [1] "adm_0_name"               "adm_0_code"              
##  [3] "year"                     "epidemiological_week"    
##  [5] "calendar_start_date"      "calendar_end_date"       
##  [7] "cum_start_date"           "cum_end_date"            
##  [9] "cum_interval"             "dengue_total_cumulative" 
## [11] "dengue_total_cumulative3" "dengue_na_approx2"       
## [13] "dengue_na_spline2"        "result"                  
## [15] "source_id"                "source_id2"              
## [17] "source_cat"               "source_cat2"             
## [19] "dengue_total_lag"         "dengue_total_inc"        
## [21] "dengue_na_approx_lag"     "dengue_na_approx_inc"    
## [23] "dengue_na_spline_lag"     "dengue_na_spline_inc"

4 Visualisations

4.1 Cumulative plot by country

4.2 Incidence plot by country

4.3 Updated data coverage

Let’s see how the data coverage has been updated so far. The figure below shows the weekly data coverage (2014-2022) by sources (raw, imputed, revised, not available) and country. Imputation filled in the gap, whereas revision may have created more. NA refers to the case when the original raw data was not available; Revised-NA refers to the case when the original raw data was available but has been replaced with NAs because they were revised down over time.

So in summary, what changes have been made to the original raw datasets?